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Identifying and Implementing Education Practices Supported by 
Rigorous Evidence: A User Friendly Guide 

By Jon Baron, Coalition for Evidence-Based Policy 

This Guide seeks to provide assistance to educational practitioners in evaluating whether an edu- 
cational intervention is backed by rigorous evidence of effectiveness, and in implementing evi- 
dence-based interventions in their schools or classrooms. By intervention, we mean an educational 
practice, strategy, curriculum, or program. The Guide is organized in four parts: 

I. A description of the randomized controiied triais, and why it is a criticai factor in 
estabiishing “strong” evidence of an intervention’s effectiveness; 

II. How to evaluate whether an intervention is backed by “strong” evidence of effectiveness; 

III. How to evaluate whether an intervention is backed by “possible” evidence of 
effectiveness; and 

IV. Important factors to consider when implementing an evidence-based intervention in your 
schools or classrooms. 

I. The randomized controlled trial: What it is, and why it is a critical factor in establishing 
“strong” evidence of an intervention’s effectiveness. 

Well-designed and implemented randomized controlled trials are considered the “gold standard” for 
evaluating an intervention’s effectiveness, in fields such as medicine, welfare and employment 
policy, and psychology.’^ This section discusses what a randomized controlled trials is, and outlines 
evidence indicating that such trials should play a similar role in education. 

A. Definition: Randomized controlled trials are studies that randomly assign individuals to anintervention 

GROUP, IN ORDER TO MEASURE THE EFFECTS OF THE INTERVENTION. 

For example, suppose you want to test, in a randomized controlled trials, whether a new math 
curriculum for third-graders is more effective than your school’s existing math curriculum for 
third-graders. You would randomly assign a large number of third-grade students to either an inter- 
vention group, which uses the new curriculum, or to a control group, which uses the existing 
curriculum. You would then measure the math achievement of both groups over time. The differ- 
ence in math achievement between the two groups would represent the effect of the new curricu- 
lum compared to the existing curriculum. 

In a variation on this basic concept, sometimes individuals are randomly assigned to two or 
more intervention groups as well as to a control group, in order to measure the effects of different 
interventions in one trials. Also, in some trials, entire classrooms, schools, or school districts - 
rather than individuals students - are randomly assigned to intervention and control groups. 

B. The unique advantage of random assignment: It enable you to evaluate whether the intervention itself, as 
OPPOSED TO other FACTORS, CAUSES THE OBSERVED OUTCOMES. 

Specifically, the process of randomly assigning a large number of individuals to either an interven- 
tion or control group ensures, to a high degree of confidence, that there are no systematic differ- 
ences between the groups in any characteristics (observed and unobserved) except one - namely, 
the intervention group participates in the intervention, and the control group does not. There fore 
- assuming the trial is properly carried out (per the guidelines below) - the resulting difference in 
outcomes between the intervention and control groups can confidently be attributed to the inter- 
vention and not to other factors. 

C. There is persuasive evidence that the randomized controlled trials, when properly designed and implemented, 
IS superior to other study designs in measuring an intervention’s true effect. 

1. “Pre-post” study examines whether participants in an intervention improve or regress during 

THE COURSE OF THE INTERVENTION, AND THEN ATTRIBUTES AND SUCH IMPROVEMENT OR REGRESSION TO THE 
INTERVENTION. 
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How to evaluate whether an educational intervention 
is supported by rigorous evidence: An overview 


step 1 . 


Is the intervention backed by "strong" evidence of effectiveness? 


Quality of studies needed to 
establish "strong" evidence: 

• Randomized controlled trials 
that are well-designed and 
implemental. 


Quantity of evidence needed: 

Trials showing effectiveness in - 

• Two or more typical school 
settings, 

• Including a setting similar to 
that of your schools/ 
classrooms. 


"Strong" 

Evidence 


Step 2. 


If the intervention is not backed by "strong" evidence, is it backed up by 
"possible" evidence of effectiveness? 


Types of studies that can comprise 
"possible" evidence: 

• Randomized controlled trials whose 
quality/quantity are good but fall short of 
"strong" evidence; and/or 

• Comparison group studies in which the 
intervention and comparison groups are 
very closely matched in academic 
achievement, demographics, and other 
characteristics. 


Types of studies that do not comprise 

"possible" evidence: 

• Pre -post studies. 

• Comparison group studies in which 
the intervention and comparison 
groups are not closely matched. 

• "Meta-analyses" that include the 
results of such lower-quality studies. 


Step 3. 


If the answers to both questions above are "no", one may conclude that the 
intervention is not supported by meaningful evidence. 
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The problem with this type of study is, without reference to a control group, it cannot answer 
whether the participants’ improvement or decline would have occurred anyway, even without the 
intervention. This often leads to erroneous conclusions about the effectiveness of the intervention. 

Example: A randomized controlled trial of Even Start - a federal program designed to improve the literacy 
of disadvantaged families — found that the program had no effect on improving the school readiness of 
participating children at the 18'^'^-month follow-up. Specifically, there were no significant differences be- 
tween young children in the program and those in the control group on measures of school readiness 
including the Picture Peabody Vocabulary Test (PPVT) and PreSchool Inventory.^ 

If a pre-post design rather than a randomized design had been used in this study, the study would 
have concluded erroneously that the program was effective in increasing school readiness. This is 
because both the children in the program and those in the control group showed improvement in 
school readiness during the course of the program (e.g., both groups of children improved substan- 
tially in their national percentile ranking on the PPVT) . A pre-post study would have attributed the 
participants’ improvement to the program whereas in fact it was the result of other factors, as 
evidenced by the equal improvement for children in the control group. 

Example: A randomized controlled trial of the Summer T raining and Education Program - a Labor Depart- 
ment pilot program that provided summer remediation and work experience for disadvantaged teenagers 
-found that program’s short-term impact on participants’ reading ability was positive. Specifically, while 
the reading ability of the control group member eroded by a full grade-level during the first summer of the 
program, the reading ability of participants in the program eroded by only a half grade-level.^ 

If a pre-post design rather than a randomized design had been used in this study, the study would 
have concluded erroneously that the program was harmful. That is, the study would have found a 
decline in participants’ reading ability and attributed it to the program, in fact, however, the par- 
ticipants’ decline in reading ability was the result of other factors - such as the natural erosion of 
reading ability during the summer vacation months - as evidenced by the even greater decline for 
members of the control group. 

2. The most common “comparison group” study designs (also known as “quasi-experimental” 
designs) also lead to erroneous conclusions in many cases. 

A. Definition: A “comparison group” study compares outcomes for intervention participants 

WITH outcomes for A COMPARISON GROUP CHOSE THROUGH METHODS OTHER THAN RANDOMIZATION. 

The following example illustrates the basic concept of this design. Suppose you want to use a com- 
parison-group study to test whether a new mathematics curriculum is effective. You would com- 
pare the math performance of students who participate in the new curriculum (“intervention group”) 
with the performance of a “comparison group” of students, chose through methods other than ran- 
domization, who do participate in the curriculum. The comparison group might be students in 
neighboring classrooms or schools that don’t use the curriculum, or students in the same grade 
and socioeconomic status selected from state or national survey data. The difference in math per- 
formance between the intervention and comparison groups following the intervention would repre- 
sent the estimated effect of the curriculum. 

Some comparison-group studies use statistical techniques to create a comparison group that is 
matched with the intervention group in socioeconomic and other characteristics, or to otherwise 
adjust for differences between the two groups that might lead to inaccurate estimates of the 
intervention’s effect. 

B. There is persuasive evidence that the most common comparison-group design produce 
ERRONEOUS CONCLUSIONS IN A SIZEABLE NUMBER OF CASES. 

A number of careful investigations have been carried out - in the areas of school dropout preven- 
tion,® K-3 class-size reduction,'^ and welfare and employment policy® — to examine whether and 
under what circumstances comparison-groups designs can replicate the results of randomized con- 
trolled trials.® These investigations first compare participants in a particular intervention with a 
control group, selected through randomization, in order to estimate the intervention’s impact in a 
randomized controlled trials. Then the same intervention participants are compared with a com- 
parison group selected through methods other than randomization, in order to estimate the 


The Journal for Vocational Special Needs Education 4^ 

intervention’s impact in a comparison-group design. Any systematic difference between the two 
estimates represents the inaccuracy produced by the comparison-group design. 

These investigations have shown that most comparison-group designs in education and other 
areas produce inaccurate estimates of an intervention’s effect. This is because of unobservable 
differences between the members of the two groups that differentially affect their outcomes. For 
example, if intervention participants self-select themselves into the intervention group, they may 
be more motivated to succeed than their control-group counterparts. Their motivation - rather 
than the intervention - may then lead to their superior outcomes. In a sizeable number of cases, 
the inaccuracy produced by the comparison-group designs is large enough to result in erroneous 
overall conclusions about whether the intervention is effective, ineffective, or harmful. 

Example from medicine. Over the past 30 years, more than two dozen comparison-group studies have 
found hormone replacement therapy for postmenopausal women to be effective in reducing the women’s 
risk of coronary heart disease, by about 35-50 percent. But when hormone therapy was finally evaluated 
in two large-scale randomized controlled trials - medicine’s “gold standard” - it was actually found to do 
the opposite: it increase the risk of heart disease, as well as stroke and breast cancer. 

Medicine contains many other important examples of interventions whose effect as measured 
in comparison-group studies was subsequently contradicted by well-designed randomized controlled 
trials. If randomized controlled trials in these cases had never been carried out and the compari- 
son-group results had been relied on instead, the result would have been needless death or serious 
illness for millions of people. This is why the Food and Drug Administration and National Institutes 
of Health generally use the randomized controlled trial as the final arbiter of which medical inter- 
ventions are effective and which are not. 

3. WELL-iHATCHED COMPARISON-GROUP STUDIES CAN BE VALUABLE IN GENERATING HYPOTHESES ABOUT “WHAT 
WORKS,” BUT THEIR RESULTS NEED TO BE CONFIRMED IN RANDOMIZED CONTROLLED TRIALS. 

The investigations, discussed above, that compare comparison-group designs with randomized con- 
trolled trials generally support the value of comparison-group designs in which the comparison 
group is very closely matched with the intervention group in prior test scores, demographics, time 
period in which they are studied, and methods used to collect outcome data. 

As discussed in section III of this Guide, we believe that such well-matched studies can play a 
valuable role in education, as they have in medicine and other fields, in establishing “possible” 
evidence an intervention’s effectiveness, and thereby generating hypotheses that merit confirma- 
tion in randomized controlled trials. But the evidence cautions strongly against using even the 
most well-matched comparison-group studies as a final arbiter of what is effective and what is not, 
or as a reliable guide to the strength of the effect. 

□.Thus, we believe there are compelling reasons why randomized controlled trials are a critical factor in 

ESTABLISHING “STRONG” EVIDENCE OF AN INTERVENTION’S EFFECTIVENESS. 

II. Howto evaluate whether an intervention is backed by “strong” evidence of effectiveness. 

This section discusses how to evaluate whether an intervention is backed by “strong” evidence 
that it will improve educational outcomes in your schools or classrooms. Specifically, it discusses 
both the quality and quantity of studies needed to establish such evidence. 

A. Quality of evidence needed to establish “strong” evidence of effectiveness: Randomized controlled trials that 
ARE well-designed AND IMPLEMENTED. 

As discussed in section I, randomized controlled trials are a critical factor in establishing “strong” 
evidence of an intervention’s effectiveness. Of course, such trials must also be well-designed and 
implemented in order to constitute strong evidence. Below is an outline of key times to look for 
when reviewing a randomized controlled trial of an educational intervention, to see whether the 
trial was well-designed and implemented. It is meant as a discussion of general principles, rather 
than as an exhaustive list of the features of such trials. 
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Key items to look for in the study’s description of the intervention 
and the random assignment process 

1. The study should clearly describe (i) the intervention, including who administered it, who received it, 

AND WHAT IT COST; (ll) HOW THE INTERVENTION DIFFERED FROM WHAT THE CONTROL GROUP RECEIVED; AND (ill) THE 
LOGIC OF HOW THE INTERVENTION IS SUPPOSED TO AFFECT OUTCOMES. 

Example. A randomized controlled trials of a one-on-one tutoring program for beginning readers should 
discuss such items as: 

•who conducted the tutoring (e.g., certified teachers, paraprofessionals, or undergraduate 
volunteers); 

•what training they received in how to tutor; 

•what curriculum they used to tutor, and other key features of the tutoring sessions (e.g., 
daily 20-minute sessions over a period of six-months); 

•the age, reading achievement levels, and other relevant characteristics of the tutored stu 
dents and controls; 

•the cost of the tutoring intervention per student; 

•the reading instruction received by the students in the control group (e.g., the school’s pre-existing 
reading program); and 

•the logic by which tutoring is supposed to improve reading outcomes. 

2 . Be alert to any indication that the random assignment process may have been compromised. 

For example, did any individuals randomly assigned to the control group subsequently cross over to 
the intervention group? Or did individuals unhappy with their prospective assignment to either the 
intervention or control group have an opportunity to delay their entry into the study until another 
opportunity arose for assignment to the preferred group? Such self-selection of individuals into 
their preferred groups undermines the random assignment process, an may well lead to inaccurate 
estimates of the intervention’s effects. 

Ideally, a study should describe the method of random assignment it used (e.g., coin toss or 
lottery), and what steps were taken to prevent undermining (e.g., asking an objective third party to 
administer the random assignment process). In reality, few studies - even well-designed trials - do 
this. But we recommend that you be alert to any indication that the random assignment process 
was compromised. 

3. The study should provide data showing that there were no systematic difference between the inter- 

vention AND CONTROL GROUPS BEFORE THE INTERVENTION. 

As discussed above, the random assignment process ensures, to a high degree of confidence, that 
there are no systematic differences between the characteristics of the intervention and control 
groups prior to the intervention. 


Key items to look for in the study’s collection of outcome data 

4. The study should use outcome measures that are “valid” - i.e., that accurately measure the true 

OUTCOMES THAT THE INTERVENTION IS DESIGNED TO AFFECT. SPECIFICALLY! 

•to test academic achievement outcomes (e.g., reading/math skills), a study should use tests 
whose ability to accurately measure true skill levels is well-established (for example, the 
Wood cock-Johnson Psychoeducational Battery, the Stanford Achievement Test, etc.). 

•wherever possible, a study should use objective, “real-world” measures of the outcomes that 
the intervention is designed to affect (e.g., for a delinquency prevention program, the stu 
dents’ official suspensions from school). 

•if outcomes are measured through interviews or observation, the interviewers /observers pref- 
erably should be kept unaware of who is in the intervention and control groups. 

Such “blinding” of the interviewers /observers, where possible, helps protect against the possi- 
bility that any bias they may have (e.g., as proponents of the intervention) could influence their 
outcome measurements. Blinding would be appropriate, for example, in a study of a violence pre- 
vention program for elementary school students, where an outcome measure is the incidence of 
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hitting on the playground as detected by an adult observer. 

When study participants are asked to “self-report” outcomes, their reports should, if possible, be 
corroborated by independent and/or objective measures. 

For instance, when participants in a substance-abuse or violence prevention program are asked 
to self-report their drug or tobacco use or criminal behavior, they tend to under-report such unde- 
sirable behaviors. In some cases, this may lead to inaccurate study results, depending on whether 
the intervention and control groups under-report by different amounts. 

Thus, studies that use such self-reported outcomes should, if possible, corroborate them with 
other measures (e.g., saliva thiocyanate tests for smoking, official arrest data, third-party observa- 
tions) . 

5. The percent of study participants that the study has lost track of when collecting outcome data should 

BE SMALL, AND SHOULD NOT DIFFER BETWEEN THE INTERVENTION AND CONTROL GROUPS. 

A general guideline is that the study should lost track of fewer than 25 % of the individuals origi- 
nally randomized - the fewer lost, the better. This is sometimes referred to as the requirement for 
“low attrition.” (Studies that choose to follow only a representative subsample of the randomized 
individuals should lose track of less than 25% of the subsample). 

Furthermore, the percentage of subjects lost track of should be approximately the same for the 
intervention and the control groups. This is because differential losses between the two groups can 
create systematic differences between the two groups, and thereby lead to inaccurate estimates of 
the intervention’s effect. This is sometimes referred to as the requirement for “no differential 
attrition.” 

6. The study should collect and report outcome data even for those members of the intervention 

GROUP WHO don’t PARTICIPATE IN OR COMPLETE THE INTERVENTION. 

This is sometimes referred to as the study’s use of an “intention-to-treat” approach, the importance 
of which is best illustrated with an example. 

Example. Consider a randomized controlled trials of a school voucher program, in which students from 
disadvantaged backgrounds are randomly assigned to an intervention group - whose members are of- 
fered vouchers to attend private school - or to a control group that does not receive voucher offers. It’s 
likely that some of the students in the intervention group will not accept their voucher offers and will 
choose instead to remain in their existing schools. Suppose that, as may well be the case, these students 
as a group are less motivated to succeed than their counterparts who accept the offer. If the trials then 
drops the students not accepting the offer from the intervention group, leaving the more motivated stu- 
dents, it would create a systematic difference between the intervention and control groups - namely, 
motivation level. Thus the study may well over-estimate the voucher program’s effect on education suc- 
cess, erroneously attributing a superior outcome for the intervention group to the vouchers when in fact it 
was due to the difference in motivation. 

Therefore, the study should collect outcome data for all the individuals randomly assigned to the 
intervention group, whether they participated in the intervention or not, and should use all such data in 
estimating the intervention’s effect. The study should also report on how many of the individuals 
assigned to the intervention group actually participated in the intervention. 

7. The study should preferably obtain data on long-term outcomes of the intervention, so that you can 

JUDGE WHETHER THE INTERVENTION’S EFFECTS WERE SUSTAINED OVER TIME. 

This is important because the effect of many interventions diminishes substantially within 2- 
3 years after the intervention ends. This has been demonstrated in randomized controlled trials in 
diverse areas such as early reading, school-based substance-abuse prevention, prevention of child- 
hood depression, and welfare-to-work and employment. In most cases, it is the longer-term effect, 
rather than the immediate effect, that is of greatest practical and policy significance. 


Key items to look for in the study’s reporting of results 

8. If the study claims that the intervention improves one or more outcomes, it should report (i) the size of 

THE EFFECT, AND (ll) STATISTICAL TESTS SHOWING THE EFFECT IS UNLIKELY TO BE DUE TO CHANCE. 

Specifically, the study should report the size of the difference in outcomes between the interven- 
tion and control groups. It should report the results of tests showing the difference is “statistically 
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significant” at conventional levels - generally the .05 level. Such a finding means that there is only 
a 1 in 20 probability that the difference could have occurred by chance if the intervention’s true 
effect is zero. 

A. In order to obtain such a finding of statistically significant effects, a study usually needs to have a 

RELATIVELY LARGE SAMPLE SIZE. 

A rough rule of thumb is that a sample size of at least 300 students (150 in the intervention group 
and 150 in the control group) is need to obtain a finding of statistical significance for an interven- 
tion that is modestly effective. If schools or classrooms, rather than individual students, are ran- 
domized, a minimum sample size of 50 to 60 schools or classrooms (25-30 in the intervention group 
and 25-30 in the control group) is needed to obtain such a finding. (This rule of thumb assumes that 
the researchers choose a sample of individuals or schools /classrooms that do not differ widely in 
initial achievement levels. )“ If an intervention is highly effective, smaller sample sizes than this 
may be able to generate a finding of statistical significance. 

If the study seeks to examine the intervention’s effect on particular subgroups within the over- 
all sample (e.g., Hispanic students), larger sample sizes than those above may be needed to gener- 
ate a finding of statistical significance for the subgroups. 

In general, larger sample sizes are better than smaller sample sizes, because they provide 
greater confidence that any difference in outcomes between the intervention and control groups is 
due to the intervention rather than chance. 

B. If the study randomizes groups (e.g., schools) rather than INDIVIDUALS, THE SAMPLE SIZE THAT THE STUDY USES 
IN TESTS FOR STATISTICAL SIGNIFICANCE SHOULD BE THE NUMBER OF GROUPS RATHER THAN THE NUMBER OF INDIVIDUALS 
IN THOSE GROUPS. 

Occasionally, a study will erroneously use the number of individuals as its sample size, and thus 
generate false findings of statistical significance. 

Example. If a study randomly assigns two schools to an intervention group and two schools to a control 
group, the sample size that the study should use in tests for statistical significance is just four, regardless 
of how many hundreds of students are in the schools. (And it is very unlikely that such a small study 
could obtain a finding of statistical significance). 

c. The study should preferable report the size of the intervention’s effects in easily understandable, 
REAL-WORLD TERMS (E.G., AN IMPROVEMENT IN READING SKILL BY TWO GRADE LEVELS, A 20 PERCENT REDUCTION IN 
WEEKLY USE OF ILLICIT DRUGS, A 20% INCREASE IN HIGH SCHOOL GRADUATION RATES). 

It is important for a study to report the size of the intervention’s effects in this way, in addition to 
whether the effects are statistically significant, so that you (the reader) can judge their educa- 
tional importance. For example, it is possible that a study with a large sample size could show 
effects that are statistically significant but so small that they have little practical or policy signifi- 
cance (e.g., a 2 point increase in SAT scores). Unfortunately, some studies report only whether the 
intervention’s effect are statistically significant, and not their magnitude. 

Some studies describe the size of the intervention’s effects in “standardized effect sizes. A 
full discussion of this concept is beyond the scope of this Guide. We merely comment that standard- 
ized effect sized may not accurately convey the educational importance of an intervention, and, 
when used, should preferable be translated into understandable, real-world terms like those used 
above. 

9. A study’s claim that the intervention’s effect on a subgroup (e.g., Hispanic students) is different than 

ITS EFFECT ON THE OVERALL POPULATION IN THE STUDY SHOULD BE TREATED WITH CAUTION. 

Specifically, we recommend that you look for corroborating evidence of such subgroup effects in 
other studies before accepting them as valid. 

This is because a study will sometimes show different effects for different subgroups Just by 
chance, particularly when the researchers examine a large number of subgroups and/or the sub- 
groups contain a small number of individuals. For example, even if an intervention’s true effect is 
the same on all subgroups, we would expect a study’s analysis of 20 subgroups to “demonstrate” a 
different effect on one of those subgroups Just by chance (at conventional levels of statistical signifi- 
cance). Thus, studies that engage in a post-hoc search for different subgroup effects (as some do) 
will sometimes turn up spurious effects rather than legitimate ones. 
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Example. In a large randomized controlled trial of aspirin for the emergency treatment of heart 
attacks, aspirin was found to be highly effective, resulting in a 23% reduction in vascular deaths at 
the one-month follow-up. To illustrate the unreliability of subgroup analyses, these overall results 
were subdivided by the patients’ astrological birth signs into 12 subgroups. Aspirin’s effects were 
similar in most subgroups to those for the whole population. However, for two of the subgroups, Libra 
and Gemini, aspirin appeared to have no effect in reducing mortality. Clearly it would be wrong to 
conclude from this analysis that heart attack patients born under the astrological signs of Libra 
and Gemini do not benefit from aspirin. 

10. The study should report the intervention’s effects on all the outcomes that the study measured, not 

JUST THOSE FOR WHICH THERE IS A POSITIVE EFFECT. 

This is because if a study measures a large number of outcomes, it may, by chance alone, find 
positive (and statistically-significant) effects on one or a few of those outcomes. Thus, the study 
should report the intervention’s effects on all measured outcomes so that you can judge whether 
the positive effects are the exception or the pattern. 

A. Quantity of evidence needed to establish “strong” evidence of effectiveness. 

1 . For reasons set out below, we believe “strong” evidence of effectiveness requires: 

(I) THAT THE INTERVENTION BE DEMONSTRATED EFFECTIVE, THROUGH WELL-DESIGNED RANDOMIZED CONTROLLED TRAILS, 

IN MORE THAN ONE SITE OF IMPLEMENTATION! AND 

(II) THAT THESE SITES BE TYPICAL SCHOOL OR COMMUNITY SETTINGS, SUCH AS PUBLIC SCHOOL CLASSROOMS TAUGHT BY 

REGULAR TEACHERS. 

Typical setting would not include, for example, specialized classrooms set up and taught by re- 
searchers for purposes of the study. 

Such a demonstration of effectiveness may require more than one randomized controlled trial 
of the intervention, or one large trial with more than one implementation site. 

2. In addition, the trials should demonstrate the intervention’s effectiveness in school settings similar to 

YOURS, BEFORE YOU CAN BE CONFIDENT IT WILL WORK IN YOUR SCHOOLS AND CLASSROOMS. 

For example, if you are considering implementing an intervention in a large inner-city public 
school serving primarily minority students, you should look for randomized controlled trials demon- 
strating the intervention’s effectiveness in similar settings. Randomized controlled trials demon- 
strating its effectiveness in a white, suburban population do not constitute strong evidence that it 
will work in your school. 

3. Main reasons why a demonstration of effectiveness in more than one site is needed: 

•A single finding of effectiveness can sometimes occur by chance alone. For example, even if 
all educational interventions tested in randomized controlled trials were ineffective, we 
would expect 1 in 20 of those trials to “demonstrate” effectiveness by chance alone at con- 
ventional levels of statistical significance. 

•The results of a trial in any one site may be dependent on site-specific factors and thus may 
not be generalizable to other sites. It is possible, for instance, that an intervention may be 
highly effective in a school with an unusually talented individual managing the details of 
implementation, but would not be effective in another school with other individuals manag- 
ing the detailed implementation. 

Example. Two multi-site randomized controlled trials of the Quantum Opportunity Program - a commu- 
nity-based program for disadvantaged high school students providing academic assistance, college and 
career planning, community service and work experiences, and other sewices - have found that the program’s 
effects vary greatly among the various program sites. A few sites - including the original program site 
(Philadelphia) -produced sizeable effects on participants’ academic and/or career outcomes, whereas 
many sites had little or no effect on the same outcomes. Thus, the program’s effects appear to be highly 
dependent on site-specific factors, and it is not clear that its success can be widely replicated. 


48 Volume 26, Number 1, Fall, 2003/ Volume 26, Number 2, Winter, 2004 

4. Pharmaceutical medicine provides an important precedent for the concept that “strong” evidence re- 
quires A SHOWING OF EFFECTIVENESS IN MORE THAN ONE INSTANCE. 

Specifically, the Food and Drug Administration (FDA) usually requires that a new pharmaceutical 
drug or medical device be shown effective in more than one randomized controlled trial before the 
FDA will grant it license to be marketed. The FDA’s reasons for this policy are similar to those 
discussed above. 

III. How to evaluate whether an intervention is backed by “possible” evidence of 
effectiveness. 

Because well-designed and implemented randomized controlled trials are not very common in edu- 
cation, the evidence supporting an intervention frequently falls short of the above criteria for “strong” 
evidence of effectiveness in one or more respects. For example, the supporting evidence may con- 
sist of: 

•Only nonrandomized studies; 

•Only one well-designed randomized controlled trial showing the intervention’s effectiveness at 
a single site; 

•Randomized controlled trials whose design and implementation contain one or more flaws 
noted above (e.g., high attrition); 

•Randomized controlled trials showing the intervention’s effectiveness as implemented by re- 
searchers in a laboratory-like setting, rather than in a typical school or community setting; 
or 

•Randomized controlled trials showing the intervention’s effectiveness for students with differ- 
ent academic skills and socioeconomic backgrounds that the students in your schools or 
classrooms. 

Whether an Intervention not supported by “strong” evidence Is nevertheless supported by “pos- 
sible” evidence of effectiveness (as opposed to no meaningful evidence of effectiveness) is a judg- 
ment call that depends, for example, on the extent of the flaws In the randomized controlled trials of 
the intervention and the quality of any nonrandomized studies that have been done. While this 
Guide cannot foresee and provide advice on all possible scenarios of evidence, it offers in this 
section a few factors to consider in evaluating whether an intervention not supported by “strong” 
evidence is nevertheless supported by “possible” evidence. 

A. Circumstances in which a comparison-group study can constitute “possible” evidence of effectiveness: 

1. The study’s intervention and comparison groups should be very closely matched in academic achieve- 
ment LEVELS, DEMOGRAPHICS, AND OTHER CHARACTERISTICS PRIOR TO THE INTERVENTION. 

The investigations, discussed in section 1, that compare comparison-group designs with random- 
ized controlled trials generally support the value of comparison-group designs in which the com- 
parison group is very closely matehed with the intervention group, in the context of education stud- 
ies, the two groups should be matched closely In characteristics including: 

•Prior test scores and other measures of academic achievement (preferably, the same mea- 
sures that the study will use to evaluate outcomes for the two groups); 

•Demographic characteristics, such as age, sex, ethnicity, poverty level, parents’ educational 
attainment, and single or two-parent family background; 

•Time period in which the two groups are studied (e.g., the two groups are children entering 
kindergarten in the same year as opposed to sequential years); and 
•Methods used to collect outcome data (e.g., the same test of reading skills administered in the 
same way to both groups). 

These investigations have also found that when the intervention and comparison groups differ 
in such characteristics, the study is unlikely to generate accurate results even when statistical 
techniques are then used to adjust for these difference in estimating the intervention’s effects. 

2. The comparison group should not be comprised of individuals who had the option to participate in the 

INTERVENTION BUT DECLINED. 

This is because individuals choosing not to participate in an intervention may differ systematically 
in their level of motivation and other important characteristics from the individuals who do choose 
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to participate. The difference in motivation (or other characteristics) may itself lead to different 
outcomes for the two groups, and thus contaminate the study’s estimates of the intervention’s 
effects. 

Therefore, the comparison group should be comprised of individuals who did not have the option 
to participate in the intervention, rather than individuals who had the option but declined. 

3. The study should preferably choose the intervention/comparison groups and outcome measures 

“prospectively” - THAT IS, BEFORE THE INTERVENTION IS ADMINISTERED. 

This is because if the groups and outcomes measures are chosen by the researchers after the 
intervention is administered (“retrospectively”), the researchers may consciously or unconsciously 
select groups and outcome measures so as to generate their desired results. Furthermore, it is 
often difficult or impossible for the reader of the study to determine whether the researchers did so. 

Prospective comparison-group studies are, like randomized controlled trials, much less suscep- 
tible to this problem. In the words of the director of drug evaluation for the Food and Drug Adminis- 
tration, “The great thing about a [randomized controlled trials or prospective comparison-group 
study] is that, within limits, you don’t have to believe anybody or trust anybody. The planning for 
[the study] is prospective; they’ve written the protocol before they’ve done the study, and any devia- 
tion that you introduce later is completely visible.” By contrast, in a retrospective study, “you al- 
ways wonder how many ways they cut the data. It’s very hard to be reassured, because there are no 
rules for doing it.”^® 

4. The study should meet the guidelines set out in section II for a well-designed randomized controlled trial 
(other than guideline 2 CONCERNING THE RANDOM-ASSIGNMENT PROCESS). 

That is, the study should use valid outcome measures, have low attrition, report tests for statistical 
significance, and so on. 

A. Studies that do wot meet the threshold for “possible” evidence of effectiveness: 

1. Pre-post studies, which often produce erroneous results, as discussed in section I. 

2. Comparison-groups studies in which the intervention and comparison groups are not well- 

matched. 

As discussed in section I, such studies also produce erroneous results in many cases, even when 
statistical techniques are used to adjust for differences between the two groups. 

Examples. As reported in Education Week, several comparison-group studies have been carried out to 
evaluate the effect of “high-stakes testing” -i.e., state-level policies in which student test scores are used 
to determine various consequences, such as whether the students graduate or are promoted to the next 
grade, whether their teachers are awarded bonuses or whether their school is taken over by the state. 
These studies compare changes in test scores and dropout rates for students in states with high-stakes 
testing (the intervention group) to those for students in other states (the comparison groups). Because 
students in different states differ in many characteristics, such as demographics and initial levels of 
academic achievement, it is unlikely that these studies provide accurate measures of the effects of high- 
stakes testing. It is not surprising that these studies reach differing conclusions about the effects of such 
testing. 

3. “Meta-analyses” that combine the results of individual studies that do not themselves meet the 

THRESHOLD FOR “POSSIBLE” EVIDENCE. 

Meta-analyses is a quantitative technique for combining the results of individual studies, a full 
discussion of which is beyond the scope of this Guide. We merely note that when meta-analysis is 
used to combine studies that themselves may generate erroneous results - such as randomized 
controlled trials with significant flaws, poorly-matched comparison group studies, and pre-post stud- 
ies - it will often produce erroneous results as well. 

Example. A meta-analysis combining the results of many nonrandomized studies of hormone replace- 
ment therapy found that such therapy significantly lowered the risk of coronary heart disease. But, as 
noted earlier, when hormone therapy was subsequently evaluated in two large-scale randomized con- 
trolled trials, it was actually found to do the opposite - namely, it increased the risk of coronary disease. 
The meta-analysis merely reflected the inaccurate results of the individual studies, producing more pre- 
cise, hut still erroneous, estimates of the therapy’s effect. 
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IV. Important factors to consider when implementing an evidence-based intervention in your 
schools or classrooms. 

A. Whether an evidence-based intervention will have a positive effect in your schools or classrooms may 

DEPEND CRITICALLY ON YOUR ADHERING CLOSELY TO THE DETAILS OF ITS IMPLEMENTATION. 

The importance of adhering to the details of an evidence-based intervention when implementing it 
in you schools or classrooms is often not fully appreciated. Details of implementation can some- 
times make a major difference in the intervention’s effects, as the following examples illustrate. 

Example. The Tennessee Class-Size Experiment - a large, multi-site randomized controlled trial in- 
volving 12,000 students - showed that significantly reduced class size for public school students in 
grades K-3 had positive effects on educational outcomes. For example, the average student in the small 
classes scored higher on the Stanford Achievement Test in reading and math than about 60% of the 
students in the regular-sized classes, and this effect diminished only slightly at the fifth-grade follow- 
up. 

Based largely on these results, in 1996 the state of California launched a much larger, state- 
wide class-size reduction effort for students in grades K-3. But to implement this effort, California 
schools hired 25,000 new K-3 teachers, many with low qualifications. Thus the proportion of fully- 
credentialed K-3 teachers fell in most California schools, with the largest drop (16%) occurring in 
the schools serving the lowest-income students. By contrast, all the teachers in the Tennessee 
study were fully qualified. This difference in implementation may account for the fact that, accord- 
ing to preliminary comparison-group data, class-size reduction in California may not be having as 
large an impact as in Tennessee.^® 

Example. Three well-designed randomized controlled trials have established the effectiveness of the 
Nurse-Family Partnership - a nurse visitation program provided to low-income, mostly single women dur- 
ing pregnancy and their children’s infancy. One of these studies included a 15-year follow-up, which 
found that the program reduced the children’s arrests, convictions, number of sexual partners, and alco- 
hol use by 50-80 percent.^^ 

Fidelity of implementation appears to be extremely important for this program. Specifically, one 
of the randomized controlled trials of the program showed that when the home visits are carried out 
by paraprofessionals rather than nurses - holding all other details the same - the program is only 
marginally effective. Furthermore, a number of other home visitation programs for low-income 
families, designed for different purposes and using different protocols, have been shown in random- 
ized controlled trials to be ineffective.^^ 

B. WhEN IMPLEMENTING AN EVIDENCE-BASED INTERVENTION, IT MAY BE IMPORTANT TO COLLECT OUTCOME DATA TO CHECK 
WHETHER ITS EFFECTS IN YOUR SCHOOLS DIFFER GREATLY FROM WHAT THE EVIDENCE PREDICTS. 

Collecting outcome data is important because it is always possible that slight differences in imple- 
mentation or setting between your schools or classrooms and those in the studies could lead to 
substantially different outcomes. So, for example, if you implement an evidence-based reading 
program in a particular group of schools or classrooms, roughly matched in reading skills and demo- 
graphic characteristics, that is not using the program. Tracking reading test scores for the two 
groups over time, while perhaps not fully meeting the guidelines for “possible” evidence described 
above, may still give you a sense of whether the program is having effects that are markedly differ- 
ent from what the evidence predicts. 

Appendix A: Where to find evidence-based interventions 

The following web sites can be useful in finding evidence-based educational interventions. These 
sites use varying criteria for determining which interventions are supported by evidence, but all 
distinguish between randomized controlled trials and other types of supporting evidence. We rec- 
ommend that, in navigating these web sites, you use this Guide to help you make independent 
Judgments about whether the listed interventions are supported by “strong” evidence, “possible” 
evidence, or neither. 

The What Works Clearinghouse ( http: / / www.w-w-c.org/) established by the U.S. Department of 
Education’s Institute of Education Sciences to provide educators, policymakers, and the public with 
a central, independent, and trusted source of scientific evidence of what works in education. 
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The Promising Practices Network ( http: / /www.promisingpractices.net/ i web site highlights pro- 
grams and practices that credible research indicates are effective in improving outcomes for chil- 
dren, youth, and families. 

Blueprints for Violence Prevention f http:/ /www.colorado.edu/cspv/blueprints/index.html ) is a 
national violence prevention initiative to identify programs that are effective in reducing adoles- 
cent violent crime, aggression, delinquency, and substance abuse. 

The International Campbell Collaboration f http: / /www.campbellcollaboration.org/Fralibrarv.html) 
offers a registry of systematic reviews of evidence on the effects of interventions in the social, 
behavioral, and educational arenas. 

Social Programs That Work f http: / /www.excelgove.org/displavContent.asp?Kevword=prppcSocial) 
offers a series of papers developed by the Collation for Evidence-Based Policy on social programs 
that are backed by rigorous evidence of effectiveness. 

Appendix B:Checklist to use in evaluating whether an 
intervention is backed by rigorous evidence 

Step 1. Is the intervention supported by “strong” evidence of effectiveness? 

E. The quality of evidence needed to establish “strong” evidence: randomized controlled trials that are well- 
designed AND IMPLEMENTED. ThE FOLLOWING ARE KEY ITEMS TO LOOK FOR IN ASSESSING WHETHER A TRIAL IS WELL-DESIGNED 
AND IMPLEMENTED. 

Key items to look for in the study’s description of the intervention and the random assign- 
ment process 

The study should clearly describe the intervention, including: (i) who administered it, who received 
it, and what it cost; (ii) how the intervention differed from what the control group received; and 
(iii) the logic of how the intervention is supposed to affect outcomes (p. 5). 

Be alert to any indication that the random assignment process may have been compromised (pp. 5- 

6 ). 

The study should provide data showing that there are no systematic differences between the inter- 
vention and control groups prior to the intervention (p. 6). 

Key items to look for in the study’s collection of outcome data 

The study should use outcome measures that are “valid” - i.e., that accurately measure the true 
outcomes that the intervention is designed to affect (pp. 6-7). 

The percent of study participants that the study has lost track of when collecting outcome data 
should be small, and should not differ between the intervention and control groups (p. 7). 

The study should collect and report outcome data even for those members of the intervention group 
who do not participate in or complete the intervention (p. 7). 

The study should preferably obtain data on long-term outcomes of the intervention, so that you can 
judge whether the intervention’s effects were sustained over time (pp. 7-8). 

Key items to look for in the study’s reporting of results 

If the study makes a claim that the intervention is effective, it should report (i) the size of the 
effect, and (ii) statistical tests showing the effect is unlikely to be the result of chance (pp. 8-9). 
A study’s claim that the intervention’s effect on a subgroup (e.g., Hispanic students) is different 
that its effect on the overall population in the study should be treated with caution (p. 9). 

The study should report the intervention’s effects on all the outcomes that the study measured, not 
Just those for which there is a positive effect (p. 9). 

F. QuANTITY of evidence needed to establish “strong” evidence of EFFECTIVENESS (P.10). 

The intervention should be demonstrated effective, through well-designed randomized controlled 
trials, in more than one site of implementation; 

These sites should be typical school or community settings, such as public school classrooms taught 
by regular teachers; and 
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•The trials should demonstrate the intervention’s effectiveness in school setting similar to yours, 
before you can be confident it will work in your schools /classrooms. 

Step 2. If the intervention is not supported by “strong” evidence, is it nevertheless supported 
by “possible” evidence of effectiveness? 

This is a judgment call that depends, for example, on the extent of the flaws in the randomized 
trials of the intervention and the quality of any nonrandomized studies that have been done. The 
following are a few factors to consider in making these judgments. 

A. Circumstances in which a comparison-group study can constitute “possible” evidence: 

The study’s intervention and comparison groups should be very closely matched in academic achieve- 
ment levels, demographics, and other characteristics prior to the intervention (pp. 11-12). 

The comparison group should not be comprised of individuals who had the option to participate in 
the intervention but declined (p. 12). 

The study should preferable choose the intervention/ comparison groups and outcome measures 
“prospectively” - i.e., before the intervention is administered (p. 12). 

The study should meet the checklist items listed above for a well-designed randomized controlled 
trial (other than the item concerning the random assignment process). That is, the study should 
use valid outcome measures, report tests for statistical significance, and so on (pp. 16-17). 
Studies that do not meet the threshold for “possible” evidence of effectiveness include: (i) pre-post 
studies (p. 2); (ii) comparison-group studies in which the intervention and comparison groups are 
not well-matched; and (iii) “meta-analyses” that combine the results of individual studies which 
do not themselves meet the threshold for “possible” evidence (p. 13). 

Step 3. If the intervention is backed by neither “strong” nor “possible” evidence, one may 
conclude that it is not supported by meaningful evidence of effectiveness. 

Address correspondence regaring this article to Jon Baron, Coalition for Evidence-Based Policy, 1301 K Street, NW, 
Washington, DC 2005, or visit the Coalition on the web atwww.excelgov.org/evidence Editor’s note: Reprinted with 
permission. 
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